This group assignment will be worth 20% of your total grade, and is marked out of a total of 126 marks.
This assignment is due at 8pm (Melbourne Time) on Monday, 20 September 2021.
For this assignment, everyone will need to upload the rendered HTML file into Moodle. Please make sure you add your name on the YAML part of this Rmd file. The HTML file will be used as your final answers to the assignment. (Remember that you need to submit 2 files: html and Rmd, please do not zip your files. NO ZIP!!!)
Keep all your work in the Rstudio cloud. We will check your project online via Rstudio cloud.
After the assignment is submitted, you will need to grade the effort and contributions of your team members. For this, we are going to use a Google doc form. I will release the instructions about this after you have submitted the assignment.
All the R code chunks should have echo and eval set to TRUE. The final knitted report should display both the code and the output of each code chunk. If a code chunk is causing error when knitting, then use echo = TRUE and eval = FALSE.
The group project must knit without any problems to produce a html report. Reports that are not knitted correctly will be awarded 0.
The data files are located inside the data folder in your assignment R projects.
To complete the assignment, you will need to fill in the blanks with the appropriate function names, arguments, or other names. These sections are marked with ___. At a minimum, your assignment should be able to be “knitted” using the Knit button for your Rmarkdown document.
The data that we are going to study in this assignment is about the Olympics games. The data span from the first Olympics held in Athens in 1986 to the 2016 Olympics held in Rio de Janeiro. We are going to run some modeling on this data. The purpose of this assignment is to show you the typical tasks performed by a data analyst from analyzing data to performing modeling to gain insights.
There are 2 data files in this assignment:
1. athlete_events.csv: The individual athlete competing in the Olympic events.
2. noc_regions.csv: National Olympic Committee with its country name.
The variables in athlete_events.csv are:
# all packages here
library(tidyverse)
library(rvest)
library(naniar)
library(plotly)
library(ggrepel)
library(readxl)
library(janitor)
library(polite)
library(ggResidpanel)
library(broom)
athlete_events data. We will also remove the 1906 Olympic Games which is not recognized by the International Olympic Committee. [3m]data_raw <- read_csv("data/athlete_events.csv")
## Rows: 271116 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (10): Name, Sex, Team, NOC, Games, Season, City, Sport, Event, Medal
## dbl (5): ID, Age, Height, Weight, Year
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
data <- data_raw %>%
filter(Year != "1906") #2m
noc_regions.csv and notes column is not needed.] [4m]noc <- read_csv("data/noc_regions.csv")
## Rows: 230 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): NOC, region, notes
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
data_noc <- data %>%
left_join(noc %>% select(NOC, region),
by = "NOC")
head(data_noc)
## # A tibble: 6 × 16
## ID Name Sex Age Height Weight Team NOC Games Year Season City
## <dbl> <chr> <chr> <dbl> <dbl> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 1 A Diji… M 24 180 80 China CHN 1992 … 1992 Summer Barc…
## 2 2 A Lamu… M 23 170 60 China CHN 2012 … 2012 Summer Lond…
## 3 3 Gunnar… M 24 NA NA Denma… DEN 1920 … 1920 Summer Antw…
## 4 4 Edgar … M 34 NA NA Denma… DEN 1900 … 1900 Summer Paris
## 5 5 Christ… F 21 185 82 Nethe… NED 1988 … 1988 Winter Calg…
## 6 5 Christ… F 21 185 82 Nethe… NED 1988 … 1988 Winter Calg…
## # … with 4 more variables: Sport <chr>, Event <chr>, Medal <chr>, region <chr>
athlete_events data set. The output should have information on the City, Country, Year and Continent. [3m]hostlink <- read_html("https://en.wikipedia.org/wiki/List_of_Olympic_Games_host_cities")
host <- hostlink %>%
html_nodes("table:nth-child(19)") %>% #2m
html_table() %>%
`[[` (1) %>% # to extract the first element from the list
select(-1) # delete first column as the column of City is duplicated.
head(host)
## # A tibble: 6 × 9
## City Country Year Continent Summer Winter `Opening ceremo… `Closing ceremo…
## <chr> <chr> <int> <chr> <chr> <chr> <chr> <chr>
## 1 Athens Greece 1896 Europe S005I "" 6 April 1896 15 April 1896
## 2 Paris France 1900 Europe S005II "" 14 May 1900 28 October 1900
## 3 St. Louis[a] United… 1904 North Am… S005I… "" 1 July 1904 23 November 1904
## 4 London[b] United… 1908 Europe S005IV "" 27 April 1908 31 October 1908
## 5 Stockholm Sweden 1912 Europe S005V "" 6 July 1912 22 July 1912
## 6 Berlin Germany 1916 Europe S006VI "" Cancelled due t… Cancelled due t…
## # … with 1 more variable:
## # .mw-parser-output .tooltip-dotted{border-bottom:1px dotted;cursor:help}Ref <chr>
host table with the data_noc table. [Note: some years might have 2 hosting cities)] [4m]data_complete <- data_noc %>%
left_join(host %>% select(Year, City, Country), # only select relevant columns
by = c("Year","City"))
head(data_complete)
## # A tibble: 6 × 17
## ID Name Sex Age Height Weight Team NOC Games Year Season City
## <dbl> <chr> <chr> <dbl> <dbl> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 1 A Diji… M 24 180 80 China CHN 1992 … 1992 Summer Barc…
## 2 2 A Lamu… M 23 170 60 China CHN 2012 … 2012 Summer Lond…
## 3 3 Gunnar… M 24 NA NA Denma… DEN 1920 … 1920 Summer Antw…
## 4 4 Edgar … M 34 NA NA Denma… DEN 1900 … 1900 Summer Paris
## 5 5 Christ… F 21 185 82 Nethe… NED 1988 … 1988 Winter Calg…
## 6 5 Christ… F 21 185 82 Nethe… NED 1988 … 1988 Winter Calg…
## # … with 5 more variables: Sport <chr>, Event <chr>, Medal <chr>, region <chr>,
## # Country <chr>
Country. Find out which cities and the years have missing values of Country. [hint: The table should not have any duplicate values. You might want to look up some new function for this which we have not taught in class.] [3m]data_complete %>%
filter(is.na(Country)) %>% #2m
select(Year, City, Country) %>%
distinct() # the function that get rid of duplicated values
## # A tibble: 15 × 3
## Year City Country
## <dbl> <chr> <chr>
## 1 1920 Antwerpen <NA>
## 2 2006 Torino <NA>
## 3 2008 Beijing <NA>
## 4 2004 Athina <NA>
## 5 1960 Squaw Valley <NA>
## 6 1956 Melbourne <NA>
## 7 1960 Roma <NA>
## 8 1980 Moskva <NA>
## 9 1976 Innsbruck <NA>
## 10 1904 St. Louis <NA>
## 11 1928 Sankt Moritz <NA>
## 12 1908 London <NA>
## 13 1948 Sankt Moritz <NA>
## 14 1896 Athina <NA>
## 15 1956 Stockholm <NA>
Now that we know which City (and Year) have issues, we have to check the wiki table and make edits to our dataset. The city name from the wiki table contains some footnote numbering. I have identified those cases for you as below: (The left is from the wiki table, the right is from the ahtlete_events data)
Add an extra row for Year (1956), City (Stockholm), Country (Sweden).
Also, change the corresponding row of MelbourneStockholm’s host country to just Australia.
(Phew! it is a lot of works isn’t it, welcome to the real life of data analyst! 😓)
I have done it for you here, as below. Please try to understand the code below as they are very useful functions!
host_c <- host %>%
mutate(City = case_when(City == "Athens" ~ "Athina",
City == "St. Louis[a]" ~ "St. Louis",
City == "London[b]" ~ "London",
City == "Antwerp[c]" ~ "Antwerpen",
City == "St. Moritz" ~ "Sankt Moritz",
City == "MelbourneStockholm[f]" ~ "Melbourne",
City == "Rome" ~ "Roma",
City == "Innsbruck[g]" ~ "Innsbruck",
City == "Moscow" ~ "Moskva",
City == "Turin" ~ "Torino",
City == "Beijing[i]" ~ "Beijing",
City == "Palisades Tahoe, then called Squaw Valley" ~ "Squaw Valley",
TRUE ~ City)) %>%
add_row(City = "Stockholm", Year = 1956, Country = "Sweden") %>%
mutate_at("Country", str_replace, "Australia\\sSweden", "Australia" )
# the "\s" is because of regular expression of a space
Country variable. [6m]final_data <- data_noc %>%
left_join(host_c %>% select(Year, City, Country), # only select relevant columns
by = c("Year","City")) #2m
head(final_data)
## # A tibble: 6 × 17
## ID Name Sex Age Height Weight Team NOC Games Year Season City
## <dbl> <chr> <chr> <dbl> <dbl> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 1 A Diji… M 24 180 80 China CHN 1992 … 1992 Summer Barc…
## 2 2 A Lamu… M 23 170 60 China CHN 2012 … 2012 Summer Lond…
## 3 3 Gunnar… M 24 NA NA Denma… DEN 1920 … 1920 Summer Antw…
## 4 4 Edgar … M 34 NA NA Denma… DEN 1900 … 1900 Summer Paris
## 5 5 Christ… F 21 185 82 Nethe… NED 1988 … 1988 Winter Calg…
## 6 5 Christ… F 21 185 82 Nethe… NED 1988 … 1988 Winter Calg…
## # … with 5 more variables: Sport <chr>, Event <chr>, Medal <chr>, region <chr>,
## # Country <chr>
any_na(final_data$Country)
## [1] FALSE
Now, the data is in good shape and we are ready to explore the data in more detail.
final_data? [Inline code is required. [1m +1m]pct_miss(final_data)
## [1] 7.843654
The percentage of missing values is 7.843654%.
gg_miss_upset(final_data)
Medal? Does it raise a concern? [Do not need inline coding.] [2m]There are 175591 observations missing only for ‘Medal’. These are high numbers for missing observations and so it does raise a concern of having these many missing observations, that too only for the Medal column. It will increase the percentage of missing values for the entire table, but the possible reason is it maybe due to no medal earned for the particular candidate and then lead to reduced in statistical power and representative of the samples.
Height and Weight only regardless of whether they have won a medal? [Do not need inline coding, but if you wish you can have it here.] [2m+1m]# coding here
pct_miss_HW <- final_data %>%
select(Height,Weight) %>%
miss_case_table()
The total percentage of missing for Height and Weight regardless of they have won the medal is 21.2745422%.
No, the missing is not at random. What year the “height” is missing, then that year the “Weight” seems to be missing as well.
Some way to treat missing value: 1. Deleting the observations 2. Deleting the variable 3. Imputation with mean / median / mode 4. Prediction of missing values.
For example,for “Weight”, we can estimate a model, then using the model to predict some values to fill in to replace NAs.
host_c data and construct a relevant plot (in descending order of the number of times countries hosted the Olympics). [4m]host_c %>%
count(Country)%>%
ggplot(aes(x = reorder(Country, -n),
y = n)) +
geom_col() +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
labs(y = "Number of hosting",
x = "",
title = "Countries that hosted Olympics from 1956-2016")
Year, Summer and Winter. Sort the data with the highest value for Summer. [5m]final_data %>%
count(Year, Season) %>%
pivot_wider(id_cols = Year,
names_from = Season,
values_from = n) %>%
arrange(desc(Summer))
## # A tibble: 34 × 3
## Year Summer Winter
## <dbl> <int> <int>
## 1 2000 13821 NA
## 2 1996 13780 NA
## 3 2016 13688 NA
## 4 2008 13602 NA
## 5 2004 13443 NA
## 6 1992 12977 3436
## 7 2012 12920 NA
## 8 1988 12037 2639
## 9 1972 10304 1655
## 10 1984 9454 2134
## # … with 24 more rows
final_data %>%
count(Year, Season) %>% # no mark, same as question 2
ggplot(aes(x = Year,
y = n,
colour = Season)) +
geom_line() +
labs(y = "Number of Athletes") -> q12
q12
ggplotly(q12)
I observed that when focus on Summer games, there are three dips in 1932 with 2969 athletes, 1956 with 5127 athletes and 1980 with 7191 athletes.
geom_text to show the name of the host city of 3 games that you mentioned in the previous question in the graph (focus only on the Summer games). Find out why there are two Cities for holding the Olympic games. Write a small paragraph about it. [4m + 2m]dip <- final_data %>%
filter(Year %in% c(1932,1956,1980) & Season == "Summer") %>% #3m
count(Year, Season, City)
q12 +
geom_text(aes(label = City), colour = "blue", data = dip) # Ask among your group member why the colour argument is not placed inside the aesthetics?
From the graph shown, in 1956 there are two hosting cities Melbourne and Stockholm. This was the first time the Olympics would be held in the Southern Hemisphere and Oceania, and it also marked the first occasion that the Games were played outside of Europe and North America. However, in the lead up to the Games, there were a series of boycotts, political problems, and controversy. Australian equine quarantine laws prevented the equestrian events from taking place in Melbourne. Therefore, the equestrian events at the 1956 Summer Olympics were held in Stockholm and included dressage, eventing, and show jumping.
data %>%
group_by(Year, Season) %>%
summarise(n_event = length(unique(Event))) %>%
ggplot(aes(x = Year,
y = n_event,
colour = Season)) +
geom_point() +
geom_line() + # show the trend
labs(y = "Number of events",
title = "Increasing number of events in the Olympics games") +
theme_minimal()
## `summarise()` has grouped output by 'Year'. You can override using the `.groups` argument.
From here onwards, we will focus on just the Summer games.
final_data2 <- final_data %>%
filter(Season == "Summer")
medals_won <- final_data2 %>%
filter(region == "Australia") %>%
select(Games, Year, Medal, Sport, Event, Country, City) %>%
distinct() %>%
group_by(Games, Year, City, Country) %>%
summarise(total_medals = sum(n_complete(Medal), na.rm = TRUE))
## `summarise()` has grouped output by 'Games', 'Year', 'City'. You can override using the `.groups` argument.
medals_won
## # A tibble: 29 × 5
## # Groups: Games, Year, City [29]
## Games Year City Country total_medals
## <chr> <dbl> <chr> <chr> <int>
## 1 1896 Summer 1896 Athina Greece 3
## 2 1900 Summer 1900 Paris France 6
## 3 1904 Summer 1904 St. Louis United States 4
## 4 1908 Summer 1908 London United Kingdom 5
## 5 1912 Summer 1912 Stockholm Sweden 7
## 6 1920 Summer 1920 Antwerpen Belgium 3
## 7 1924 Summer 1924 Paris France 6
## 8 1928 Summer 1928 Amsterdam Netherlands 4
## 9 1932 Summer 1932 Los Angeles United States 5
## 10 1936 Summer 1936 Berlin Germany 1
## # … with 19 more rows
yearhost <- medals_won %>%
filter(Country == "Australia")
yearhost
## # A tibble: 2 × 5
## # Groups: Games, Year, City [2]
## Games Year City Country total_medals
## <chr> <dbl> <chr> <chr> <int>
## 1 1956 Summer 1956 Melbourne Australia 35
## 2 2000 Summer 2000 Sydney Australia 58
In year 1956 and 2000Australia host the Olympic game.
medals_won %>%
ggplot(aes(x = Year,
y = total_medals)) +
geom_line(aes()) +
geom_point(shape = 1, size = 3, colour = "red", data = yearhost) +
geom_label_repel(aes(label = City), color = "blue", data = yearhost) +
labs(y = "Total Medals",
title = "Number of medals won by Australia at Olympic games")
n_sex <- final_data2 %>%
group_by(Year, Sex) %>%
summarize(n_ath = sum(n())) # you might want to fill up multiple function here #3m
## `summarise()` has grouped output by 'Year'. You can override using the `.groups` argument.
n_sex
## # A tibble: 55 × 3
## # Groups: Year [28]
## Year Sex n_ath
## <dbl> <chr> <int>
## 1 1896 M 380
## 2 1900 F 33
## 3 1900 M 1903
## 4 1904 F 16
## 5 1904 M 1285
## 6 1908 F 47
## 7 1908 M 3054
## 8 1912 F 87
## 9 1912 M 3953
## 10 1920 F 134
## # … with 45 more rows
ggplot(n_sex, aes(x = Year, y =n_ath,
group = Sex, colour = Sex)) +
geom_col(position = "fill")
I observed that participation rate between male and female are not equal, male participants are always more than female over the time from 1890 till now. But looking at the graph, female participants isincreased while male participants is decreased over the time , so it is unsure to say the male participants will still stay more than female in later years.
final_data2 %>%
group_by(Name, Team) %>%
summarize(n_medal = n_complete(Medal)) %>% # multiple functions are allowed here #2m
arrange(desc(n_medal)) %>%
top_n(20)
## `summarise()` has grouped output by 'Name'. You can override using the `.groups` argument.
## Selecting by n_medal
## # A tibble: 118,667 × 3
## # Groups: Name [115,524]
## Name Team n_medal
## <chr> <chr> <int>
## 1 "Michael Fred Phelps, II" United States 28
## 2 "Larysa Semenivna Latynina (Diriy-)" Soviet Union 18
## 3 "Nikolay Yefimovich Andrianov" Soviet Union 15
## 4 "Borys Anfiyanovych Shakhlin" Soviet Union 13
## 5 "Edoardo Mangiarotti" Italy 13
## 6 "Takashi Ono" Japan 13
## 7 "Aleksey Yuryevich Nemov" Russia 12
## 8 "Dara Grace Torres (-Hoffman, -Minas)" United States 12
## 9 "Jennifer Elisabeth \"Jenny\" Thompson (-Cumpelik)" United States 12
## 10 "Natalie Anne Coughlin (-Hall)" United States 12
## # … with 118,657 more rows
There are a lot more analysis that you can do, but I will stop here. Please explore the data on your own time 😇.
Let’s predict the number of medals won!
According to research, country GDP, population, home field advantage are the most common factors in predicting the medal won by the country.
world_pop <- read_excel("data/world_pop.xlsx")
world_gdp <- read_excel("data/world_gdp.xlsx", range = cell_limits(c(4,1),c(NA,NA)))
head(world_pop)
## # A tibble: 6 × 61
## Country `Country Code` `Indicator Name` `Indicator Code` `1960` `1961` `1962`
## <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 Aruba ABW Population, tot… SP.POP.TOTL 5.42e4 5.54e4 5.62e4
## 2 Afghani… AFG Population, tot… SP.POP.TOTL 9.00e6 9.17e6 9.35e6
## 3 Angola AGO Population, tot… SP.POP.TOTL 5.64e6 5.75e6 5.87e6
## 4 Albania ALB Population, tot… SP.POP.TOTL 1.61e6 1.66e6 1.71e6
## 5 Andorra AND Population, tot… SP.POP.TOTL 1.34e4 1.44e4 1.54e4
## 6 UAE ARE Population, tot… SP.POP.TOTL 9.26e4 1.01e5 1.12e5
## # … with 54 more variables: 1963 <dbl>, 1964 <dbl>, 1965 <dbl>, 1966 <dbl>,
## # 1967 <dbl>, 1968 <dbl>, 1969 <dbl>, 1970 <dbl>, 1971 <dbl>, 1972 <dbl>,
## # 1973 <dbl>, 1974 <dbl>, 1975 <dbl>, 1976 <dbl>, 1977 <dbl>, 1978 <dbl>,
## # 1979 <dbl>, 1980 <dbl>, 1981 <dbl>, 1982 <dbl>, 1983 <dbl>, 1984 <dbl>,
## # 1985 <dbl>, 1986 <dbl>, 1987 <dbl>, 1988 <dbl>, 1989 <dbl>, 1990 <dbl>,
## # 1991 <dbl>, 1992 <dbl>, 1993 <dbl>, 1994 <dbl>, 1995 <dbl>, 1996 <dbl>,
## # 1997 <dbl>, 1998 <dbl>, 1999 <dbl>, 2000 <dbl>, 2001 <dbl>, 2002 <dbl>, …
head(world_gdp)
## # A tibble: 6 × 61
## `Country Name` `Country Code` `Indicator Name` `Indicator Code` `1960`
## <chr> <chr> <chr> <chr> <dbl>
## 1 Afghanistan AFG GDP (current US$) NY.GDP.MKTP.CD 537777811
## 2 Albania ALB GDP (current US$) NY.GDP.MKTP.CD NA
## 3 Algeria DZA GDP (current US$) NY.GDP.MKTP.CD 2723648552
## 4 American Samoa ASM GDP (current US$) NY.GDP.MKTP.CD NA
## 5 Andorra AND GDP (current US$) NY.GDP.MKTP.CD NA
## 6 Angola AGO GDP (current US$) NY.GDP.MKTP.CD NA
## # … with 56 more variables: 1961 <dbl>, 1962 <dbl>, 1963 <dbl>, 1964 <dbl>,
## # 1965 <dbl>, 1966 <dbl>, 1967 <dbl>, 1968 <dbl>, 1969 <dbl>, 1970 <dbl>,
## # 1971 <dbl>, 1972 <dbl>, 1973 <dbl>, 1974 <dbl>, 1975 <dbl>, 1976 <dbl>,
## # 1977 <dbl>, 1978 <dbl>, 1979 <dbl>, 1980 <dbl>, 1981 <dbl>, 1982 <dbl>,
## # 1983 <dbl>, 1984 <dbl>, 1985 <dbl>, 1986 <dbl>, 1987 <dbl>, 1988 <dbl>,
## # 1989 <dbl>, 1990 <dbl>, 1991 <dbl>, 1992 <dbl>, 1993 <dbl>, 1994 <dbl>,
## # 1995 <dbl>, 1996 <dbl>, 1997 <dbl>, 1998 <dbl>, 1999 <dbl>, 2000 <dbl>, …
Year, the column for population should be named pop and the column for gdp should be named gdp. [6m]world_pop1 <- world_pop %>%
pivot_longer(cols = -c("Country","Country Code", "Indicator Name", "Indicator Code"),
names_to = "Year",
values_to = "pop") %>%
clean_names() %>%
mutate(year = as.numeric(year))
head(world_pop1)
## # A tibble: 6 × 6
## country country_code indicator_name indicator_code year pop
## <chr> <chr> <chr> <chr> <dbl> <dbl>
## 1 Aruba ABW Population, total SP.POP.TOTL 1960 54211
## 2 Aruba ABW Population, total SP.POP.TOTL 1961 55438
## 3 Aruba ABW Population, total SP.POP.TOTL 1962 56225
## 4 Aruba ABW Population, total SP.POP.TOTL 1963 56695
## 5 Aruba ABW Population, total SP.POP.TOTL 1964 57032
## 6 Aruba ABW Population, total SP.POP.TOTL 1965 57360
world_gdp1 <- world_gdp %>%
pivot_longer(cols = -c("Country Name","Country Code", "Indicator Name", "Indicator Code"),
names_to = "Year",
values_to = "gdp") %>%
clean_names() %>%
mutate(year = as.numeric(year))
head(world_gdp1)
## # A tibble: 6 × 6
## country_name country_code indicator_name indicator_code year gdp
## <chr> <chr> <chr> <chr> <dbl> <dbl>
## 1 Afghanistan AFG GDP (current US$) NY.GDP.MKTP.CD 1960 537777811
## 2 Afghanistan AFG GDP (current US$) NY.GDP.MKTP.CD 1961 548888896
## 3 Afghanistan AFG GDP (current US$) NY.GDP.MKTP.CD 1962 546666678
## 4 Afghanistan AFG GDP (current US$) NY.GDP.MKTP.CD 1963 751111191
## 5 Afghanistan AFG GDP (current US$) NY.GDP.MKTP.CD 1964 800000044
## 6 Afghanistan AFG GDP (current US$) NY.GDP.MKTP.CD 1965 1006666638
homefield, indicating 1 if the Team is the same as the hosted Country variable, else 0. Calculate the total medals won by each Team. [6m]final_data2 %>%
filter(Year %in% c(1992:2016)) %>%
mutate(homefield = ifelse(Team == Country, 1, 0)) %>%
select(Games, Year, Medal, Sport, Event, Country, City, Team, homefield) %>%
distinct() %>%
group_by(Year, Country, Team, homefield) %>%
summarise(total_medals = sum(n_complete(Medal), na.rm = TRUE)) -> final
## `summarise()` has grouped output by 'Year', 'Country', 'Team'. You can override using the `.groups` argument.
head(final)
## # A tibble: 6 × 5
## # Groups: Year, Country, Team [6]
## Year Country Team homefield total_medals
## <dbl> <chr> <chr> <dbl> <int>
## 1 1992 Spain Albania 0 0
## 2 1992 Spain Algeria 0 2
## 3 1992 Spain American Samoa 0 0
## 4 1992 Spain Andorra 0 0
## 5 1992 Spain Angola 0 0
## 6 1992 Spain Antigua and Barbuda 0 0
gdp and pop data are included in one data set. [4m]final %>%
left_join(world_gdp1 %>% select(year, country_name, gdp),
by = c("Team" = "country_name", "Year" = "year")) %>%
left_join(world_pop1 %>% select(year, country, pop),
by = c("Team" = "country", "Year" = "year")) %>%
mutate(gdp_m = gdp/1000000000,
pop_m = pop/1000000000) -> model_data
ggplot(model_data,
aes(x = total_medals)) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The y variable appears to have a lot of zero values and it is a right-skewed data.
gdp_m, pop_m and homefield data. Interpret the coefficients of the model. [4m+ 3m]linear_model <- lm(total_medals ~ gdp_m + pop_m + homefield , data = model_data) # the first argument contains a long coding. # 3m
tidy(linear_model)
## # A tibble: 4 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 2.15 0.260 8.26 4.10e-16
## 2 gdp_m 0.00898 0.000401 22.4 7.21e-92
## 3 pop_m 6.69 2.06 3.25 1.20e- 3
## 4 homefield 22.9 3.71 6.17 9.38e-10
The Y-variable and the other dependent variables are corresponding to each other accordingly. In positive coefficient, Country is expected to earn more total medals, on average, by 0.0090, for each additional $1000000000 GDP. Country is expected to earn more total medals, on average, by 6.6929, for each additional 1000000000 population. Country is expected to earn more total medals, on average, by 22.9147, if it is at athletes homefield.
glance(linear_model)
## # A tibble: 1 × 12
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.442 0.440 8.18 288. 9.07e-138 3 -3857. 7724. 7749.
## # … with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
Roughly 44.03% of the variability in total_medals can be explained by gdp_m, pop_m and homefield. So the goodness of fit of the model is considered to be relative weak, since the explained sample in the model only below 50%.
# combine all data
model_data_complete <- augment(linear_model)
ggplot(data = model_data_complete,
aes(x = .fitted,
y = .resid)) +
geom_point() +
geom_hline(yintercept = 0, colour = "red")
# oR
resid_panel(linear_model, plot = "all")
I observed that the model doesn’t really have residuals around the red line as a perfect fit model would have a horizontal line at zero. Also the residuals are kind of right skewed distributed.
Australia GDP: 1,359,330,000,000
Australia population as of 2020: 25,694,393
Calculation: total_medals = 2.147314278 + 0.008981092gdp_m + 6.692875882pop_m + 22.914676098homefield
total_medals^hat = 2.147314278 + 0.008981092* 1359330000000/1000000000 + 6.692875882* 25694393/1000000000 + 22.914676098* 0 = 14.52755
Comparison: In Olympic Games Tokyo 2020, Australia won 46 medals in total.
Compare with the predicted value from created regression model, the predicted value is differ with the actual data. Actual data is 46 medals but predicted value is around 15 medals.
Explanation: The predicted value is smaller than the predicted value by more than half, which indicates this regression model is not an appropriate model to predict medal earns in Olympics. The reason for the large difference between the predicted value and the actual data may be the lack of variables is added in, there should be others dependent variables which will affect the number of medal earns.
The original Olympics Games data was obtained from https://www.kaggle.com/heesoo37/120-years-of-olympic-history-athletes-and-results
The population and GDP data was obtained from https://www.gapminder.org/data/